knitr::opts_chunk$set(cache = TRUE)

This is working from the word doc and html file about ANOVA

1

work through the ANOVA html document

Reading in the rarefied OTU table

# going to where the file is
setwd("~/Dropbox/gradschool/TA/BIOL2002/bioinf/analysis/taxasummaryplots_ANOVA/")
 # using read.delim becaue I don't know the separator
otu <- read.table(file = "otu_table_d8000.txt", # the file to read in
                  sep = "\t", # setting the field separator as a tab
                  skip = 1, # skipping the first comment line
                  comment.char = "", # the header of the file starts with a #
                  as.is = T, # read in the text as character strings
                  check.names = FALSE,
                  header = TRUE) # there are colnames in the file, so use them

Don’t use file.choose()

What kind of cols did we get?

names(otu)
##  [1] "#OTU ID"   "700033789" "700110415" "700114832" "700099121"
##  [6] "700108086" "700110349" "700102581" "700037714" "700110808"
## [11] "700038795" "700016099" "700102647" "700024001" "700106052"
## [16] "700038736" "700114649" "700024701" "700110748" "700098589"
## [21] "700107806" "700100647" "700016059" "700038938" "700099617"
## [26] "700037732" "700106916" "700038998" "700103587" "700105705"
## [31] "700016333" "700114854" "700114747" "700110689" "700016374"
## [36] "700106617" "700037702" "700014994" "700111841" "700103623"
## [41] "700102924" "700114444" "700023477" "700110746" "700095429"
## [46] "700038978" "700103517" "700099408" "700109551" "700101269"
## [51] "700097718" "700114767" "700106089" "700106563" "700024149"
## [56] "700097650" "700023313" "700107864" "700106637" "700114872"
## [61] "700097589" "700103653" "700114818" "700097313" "700014956"
## [66] "700013596" "700095235" "700101366" "700103710" "700023267"
## [71] "700038950" "700103303" "700111775" "700101335" "700016405"
## [76] "700108266" "700024097" "700095275" "taxonomy"

formatting OTU table for downstream analysis

For future work, know that the first column are the OTU ids, and the last column are taxonomy of the OTU

Finding the number of duplicated taxons. The duplicated function returns a logical vector testing to see if a certain entry is duplicated, a TRUE if it is, FALSE if it is not.

head(duplicated(otu$table)) # take a look at the beginning of the vector
## logical(0)

The cool thing with a vector of logicals, is that you can sum them, and TRUE is counted as 1, and FALSE is 0 – think binary. So we can sum that vector to see how many are duplicate

dups <- sum(duplicated(otu$taxonomy))

Note: the numbers that I am going to get will be different than yours – you should know why. If not, think about the otu table I am reading in, and the one that you are reading in.

Now, how many unique taxa?

uniq <- length(unique(otu$taxonomy))

dups and uniq should sum to be the same as the length of otu$taxonomy

dups + uniq == length(otu$taxonomy)
## [1] TRUE

Nice.

combining otus that have the same taxonomic identity

We’re going to use the aggregate function to aggregate the count data based on taxonomy of the OTU. For this we need to leave out the OTU ID, simply subset the dataframe with otu[, -1] … then also the taxonomy column, which is the last column. I’m going to use dim to find the last column by getting the second value that dim returns (dim returns the number of rows, then cols). Using the - notation to deselect rows, I need to make a vector of the two cols I don’t want. so it’s going to look like otu[ , -c(1, dim(otu)[2])]. My final aggregate statement is below.

otu.m <- aggregate( otu[, -c(1, dim(otu)[2])],
                    by = list(otu$taxonomy),
                    FUN=sum)
dim(otu.m)
## [1] 578  78

Nice.

Looking at how it changed things:

names(otu.m)
##  [1] "Group.1"   "700033789" "700110415" "700114832" "700099121"
##  [6] "700108086" "700110349" "700102581" "700037714" "700110808"
## [11] "700038795" "700016099" "700102647" "700024001" "700106052"
## [16] "700038736" "700114649" "700024701" "700110748" "700098589"
## [21] "700107806" "700100647" "700016059" "700038938" "700099617"
## [26] "700037732" "700106916" "700038998" "700103587" "700105705"
## [31] "700016333" "700114854" "700114747" "700110689" "700016374"
## [36] "700106617" "700037702" "700014994" "700111841" "700103623"
## [41] "700102924" "700114444" "700023477" "700110746" "700095429"
## [46] "700038978" "700103517" "700099408" "700109551" "700101269"
## [51] "700097718" "700114767" "700106089" "700106563" "700024149"
## [56] "700097650" "700023313" "700107864" "700106637" "700114872"
## [61] "700097589" "700103653" "700114818" "700097313" "700014956"
## [66] "700013596" "700095235" "700101366" "700103710" "700023267"
## [71] "700038950" "700103303" "700111775" "700101335" "700016405"
## [76] "700108266" "700024097" "700095275"

So … what changed??

# restoring correctness to the world
names(otu.m)[1] <- "taxonomy"

number of reads equal for all samples?

sum(colSums(otu.m[,-1]) == 8000) == dim(otu.m)[2] -1
## [1] TRUE

made an equality statement, asking if there are the same number of TRUEs as there are sample columns.

chewie relax

converting read abundances to proportions

So I am going to be making a function to turn the otu table from counts to proportions.

ConvertToProportionOTU <- function(count.otu) {
  # copying the count.otu table into prop.otu table so that I can store
  # the values into it later. 
  prop.otu <- count.otu
  
  for(i in 2:ncol(otu.m)){
  prop.otu[,i] <- 100*count.otu[,i]/sum(count.otu[,i])
  }
  
  return(prop.otu)
}
otu.prop.m <- ConvertToProportionOTU(count.otu = otu.m)

So, now checking the colsums of the pro

head(colSums(otu.prop.m[ ,-1])) 
## 700033789 700110415 700114832 700099121 700108086 700110349 
##       100       100       100       100       100       100

Filtering Rare Taxa

We are going to filter out the taxa that occurr in fewer than 4 individuals.

Breaking down what the rowSums(otu.m[,-1] > 0) statement does:

  • otu.m[,-1] > 0 asks which elements in the dataframe are greater than 0. It creates a logical matrix (all TRUE or FALSE)
  • rowSums sums each row, remember TRUE is a 1, and FALSE is a 0

then we go and see which rows have more than 4 samples that contain that OTU

# define a cutoff
cutoff <- rowSums(otu.m[,-1] > 0) > 4

# what is that doing?
cutoff
##   [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE
##  [12] FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE
##  [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
##  [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
##  [45] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
##  [56]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE
##  [67] FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE
##  [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [89] FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
## [100]  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [111] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [122] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE
## [133]  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
## [144]  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE
## [155] FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE
## [166] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
## [177] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [188] FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE
## [199]  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [210]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
## [221] FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE
## [232] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE
## [243]  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE
## [254] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
## [265]  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE
## [276]  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [287]  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE
## [298]  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [309]  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE
## [320]  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [331]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE
## [342] FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE
## [353] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE
## [364] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
## [375]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [386] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [397] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [408] FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE
## [419] FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE
## [430] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE
## [441] FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE
## [452] FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
## [463]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE
## [474]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE
## [485] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
## [496] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [507]  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE
## [518] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE
## [529] FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
## [540]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE
## [551] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
## [562] FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
## [573] FALSE  TRUE  TRUE  TRUE FALSE  TRUE
sum(cutoff==T) # how many ones are we going to keep?
## [1] 251
# filter your OTU table based on that cutoff
otu.prop.filtered <- otu.prop.m[cutoff,]

# what is this doing?
length(otu.prop.filtered[,1])==sum(cutoff==T)
## [1] TRUE

Normalization

Loading in some more packages

library(car)
# apparently the lme4 package is not needed?

Making the function that they gave us

boxCoxNormalize <- function(x){
  if(length(x)<=1){
    x;
    }
    else{
        if(sum(x<=0)>0){
            x=x-min(x)+0.0001;
        }
        p=powerTransform(x)[6]$start;
        xnorm=(x^p -1)/p
        xnorm
    }
}

Doing what they gave us … changing the names because my dataframes are special snowflakes

# Define the first column of otu.m as the variable 'taxonomy'
taxonomy = otu.prop.filtered[,1]

# boxCoxNormalize() requires a matrix with only numeric data
otu.n <- as.matrix(otu.prop.filtered[,2:ncol(otu.prop.filtered)])

# apply boxCoxNormalize() using a for loop to the new matrix
for (i in 1:nrow(otu.n)){
  otu.n[i,] = boxCoxNormalize(otu.n[i,])
}

# add 'taxonomy' back as column #1 to otu.n. Why is the as.data.frame function important here?
otu.prop.normalized <-  cbind(taxonomy, as.data.frame(otu.n))

Bringing in the metadata

It’s a party

metadata <- read.table('HMP_5BS_metadata.txt', header=T, sep='\t', check.names=F)

what variable names do we have?

names(metadata)
## [1] "SampleID"             "BarcodeSequence"      "LinkerPrimerSequence"
## [4] "Sex"                  "BodySite"             "SRS_SampleID"        
## [7] "FASTA_FILE"           "Description"

matching the metadata to the samples that we have:

# finding the samples in the metadata that match the samples we have 
metadata <- metadata[as.character(metadata$SampleID) %in%
                       colnames(otu.prop.normalized[,2:ncol(otu.prop.normalized)]),]
# finding the order that the rows match the colums
mm <- match(colnames(otu.prop.normalized[, -1]),
            as.character(metadata$SampleID))

# reordering the metadata based on matching rows
metadata <- metadata[mm,]

# checking to see if the names are the same 
as.character(metadata$SampleID) ==
  colnames(otu.prop.normalized[,2:ncol(otu.prop.normalized)])
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [15] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [29] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [43] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [57] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [71] TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Defining Covariates

sex <- metadata$Sex
sex <- factor(sex,c("male","female"))
bs <- metadata$BodySite
bs <- factor(bs,c("Subgingival_plaque","Saliva","Left_Retroauricular_crease","Stool","Mid_vagina"))
ba <- metadata$Description
ba <- factor(ba,c("Urogenital_tract","Oral","Gastrointestinal_tract","Skin"))

Creating Empy vectors for p-values

pvals.sex <- vector()
pvals.bs <- vector()
pvals.ba <- vector()

Doing the ANOVA

# assign otu.m[1,] to variable taxon1
y <- as.numeric(otu.prop.normalized[8,2:ncol(otu.prop.normalized)])

# do an ANOVA on otu.m[,1] and sex, to test for significnat differences in distributions of the taxon in row #1 between males and females
aovOut <- aov(y ~ sex)

# view the output of the ANOVA using summary(aovOut)
summary(aovOut)
##             Df    Sum Sq Mean Sq F value Pr(>F)
## sex          1   5236234 5236234   2.606  0.111
## Residuals   75 150691463 2009220

Doing all this stuff with for loops

# so I renamed my out table to otu.prop.normalized
# since the rest of the code is with otu.m, I'm just going to rename my dataframe
# as otu.m
otu.m <- otu.prop.normalized

for(i in 1:nrow(otu.m)){
  y <- as.numeric(otu.m[i,2:ncol(otu.m)])
  aovOut <- aov(y ~ sex)
  pvals.sex[i] <- summary(aovOut)[[1]][1,5]
}

for(i in 1:nrow(otu.m)){
  y <- as.numeric(otu.m[i,2:ncol(otu.m)])
  aovOut <- aov(y ~ bs)
  pvals.bs[i] <- summary(aovOut)[[1]][1,5]
}

for(i in 1:nrow(otu.m)){
  y <- as.numeric(otu.m[i,2:ncol(otu.m)])
  aovOut <- aov(y ~ ba)
  pvals.ba[i] <- summary(aovOut)[[1]][1,5]
}

False Discovery rate correction

pvals.sex = p.adjust(pvals.sex, "fdr")
pvals.bs = p.adjust(pvals.bs, "fdr")
pvals.ba = p.adjust(pvals.ba, "fdr")

How many p-vals are significant

sum(pvals.sex < 0.05)
## [1] 4
sum(pvals.bs < 0.05)
## [1] 225
sum(pvals.ba < 0.05)
## [1] 224

Which OTUs are significant?

otu.m[which(pvals.sex < 0.05),1]
otu.m[which(pvals.bs < 0.05),1]
otu.m[which(pvals.ba < 0.05),1]

… this dumps a ton of output

Plotting it

par(mar=c(2,2,2,2),mgp=c(0.5,1.2,0))
par(xpd=T)
par(mfrow=c(1,1))
for(i in which(pvals.bs < 0.05)){
  y = as.numeric(otu.m[i,2:ncol(otu.m)])
  name1 = gsub('; .__',' ', as.character(otu.m[i,1]))
  name2 = gsub('k__','', name1)
  name3 = gsub('  ','', name2)
  name = strsplit(name3, " ", fixed=T)[[1]]
  names_tail = tail(name, n=2)
  boxplot(y ~ bs, main=names_tail, ylab="Relative abundance", col=c("tomato", "gray", "gold","tan4","dodgerblue"))}

2

This picks up in the ANOVA html document where the otu_tax_lev() is mentioned

Like in the Taxa summary assignment, I saved the otu_tax_lev() function to a different R script, and now I can source from that, keeping my analysis script clean. Nice.

Also I need to load plyr since the function uses it. If you get an error loading this, make sure it’s installed (install.packages("plyr"))

# going to where the script file is
setwd("~/Dropbox/gradschool/TA/BIOL2002/bioinf/scripts/")
source("otu_tax_lev.R") # note, plyr gets loaded during the source of this 

Now I need to go to the dir where my data files are, since the file needs to be able to load the files (which isn’t a good practice)

setwd("~/Dropbox/gradschool/TA/BIOL2002/bioinf/analysis/taxasummaryplots_ANOVA/")
otu_TL.phylum <- otu_tax_lev('otu_table_d8000.txt', TL='phylum')
# what does the output look like?
str(otu_TL.phylum)
## List of 2
##  $ otu_table: num [1:20, 1:77] 7236 150 337 17 259 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:20] "k__Bacteria; p__Firmicutes" "k__Bacteria; p__Actinobacteria" "k__Bacteria; p__Proteobacteria" "k__Bacteria; p__Bacteroidetes" ...
##   .. ..$ : chr [1:77] "700033789" "700110415" "700114832" "700099121" ...
##  $ otuNames : chr [1:20] "k__Bacteria; p__Firmicutes" "k__Bacteria; p__Actinobacteria" "k__Bacteria; p__Proteobacteria" "k__Bacteria; p__Bacteroidetes" ...
head(otu_TL.phylum$otu_table)
##                                700033789 700110415 700114832 700099121
## k__Bacteria; p__Firmicutes          7236      7799      7835      6250
## k__Bacteria; p__Actinobacteria       150       102       115       909
## k__Bacteria; p__Proteobacteria       337        20        17       181
## k__Bacteria; p__Bacteroidetes         17        77        17       537
## k__Bacteria; p__Tenericutes          259         2         9       116
## k__Bacteria; p__Fusobacteria           1         0         7         7
##                                700108086 700110349 700102581 700037714
## k__Bacteria; p__Firmicutes          3543      7970      7988      7337
## k__Bacteria; p__Actinobacteria      4188        13         6        46
## k__Bacteria; p__Proteobacteria       224         5         4       394
## k__Bacteria; p__Bacteroidetes         16        10         2       110
## k__Bacteria; p__Tenericutes            0         0         0        92
## k__Bacteria; p__Fusobacteria           6         1         0         0
##                                700110808 700038795 700016099 700102647
## k__Bacteria; p__Firmicutes          7474      7760      6444      7732
## k__Bacteria; p__Actinobacteria       216         7       139        19
## k__Bacteria; p__Proteobacteria        14        15        80         6
## k__Bacteria; p__Bacteroidetes        270       201       891       124
## k__Bacteria; p__Tenericutes           22        15        33       119
## k__Bacteria; p__Fusobacteria           0         2       352         0
##                                700024001 700106052 700038736 700114649
## k__Bacteria; p__Firmicutes          6605      7817      7979      7531
## k__Bacteria; p__Actinobacteria        92        63         9       298
## k__Bacteria; p__Proteobacteria      1098        81         9        32
## k__Bacteria; p__Bacteroidetes        128        12         0       138
## k__Bacteria; p__Tenericutes            0         0         1         1
## k__Bacteria; p__Fusobacteria           0         1         2         0
##                                700024701 700110748 700098589 700107806
## k__Bacteria; p__Firmicutes          2446        76      3461      1218
## k__Bacteria; p__Actinobacteria      5467      7733      3731      6740
## k__Bacteria; p__Proteobacteria        33        58       583        33
## k__Bacteria; p__Bacteroidetes         14        83       190         5
## k__Bacteria; p__Tenericutes            1         0         8         0
## k__Bacteria; p__Fusobacteria           3        38        20         2
##                                700100647 700016059 700038938 700099617
## k__Bacteria; p__Firmicutes           570      3315      1638      3352
## k__Bacteria; p__Actinobacteria      7355       418      2415       138
## k__Bacteria; p__Proteobacteria        47       833       312      2877
## k__Bacteria; p__Bacteroidetes         17      2782      1754       829
## k__Bacteria; p__Tenericutes            0         7         0         3
## k__Bacteria; p__Fusobacteria          10       489      1711       327
##                                700037732 700106916 700038998 700103587
## k__Bacteria; p__Firmicutes          2396      1383      2621      4648
## k__Bacteria; p__Actinobacteria      3890      1339      3848       469
## k__Bacteria; p__Proteobacteria       695      3303       268       869
## k__Bacteria; p__Bacteroidetes        578      1731       237      1272
## k__Bacteria; p__Tenericutes            0         0         0         0
## k__Bacteria; p__Fusobacteria         385       243       264       704
##                                700105705 700016333 700114854 700114747
## k__Bacteria; p__Firmicutes          1973      1540      1879       559
## k__Bacteria; p__Actinobacteria       811      2014      2315      2723
## k__Bacteria; p__Proteobacteria       354       319       505      1358
## k__Bacteria; p__Bacteroidetes       2383      1940      1906      2124
## k__Bacteria; p__Tenericutes           16        19         0        15
## k__Bacteria; p__Fusobacteria        2252      1382      1154      1019
##                                700110689 700016374 700106617 700037702
## k__Bacteria; p__Firmicutes          1170      1355      3326      4611
## k__Bacteria; p__Actinobacteria      2023      1181       254      3247
## k__Bacteria; p__Proteobacteria       531       782       675       123
## k__Bacteria; p__Bacteroidetes       2353      2441      1940        10
## k__Bacteria; p__Tenericutes            0        23       117         1
## k__Bacteria; p__Fusobacteria        1739      1377       133         1
##                                700014994 700111841 700103623 700102924
## k__Bacteria; p__Firmicutes          1125      1926      3045      1218
## k__Bacteria; p__Actinobacteria       423        90       134         1
## k__Bacteria; p__Proteobacteria       559       440      2518       337
## k__Bacteria; p__Bacteroidetes       3075      4463      1619      6442
## k__Bacteria; p__Tenericutes           40         0         1         0
## k__Bacteria; p__Fusobacteria        2203       877       292         0
##                                700114444 700023477 700110746 700095429
## k__Bacteria; p__Firmicutes          6454      4579      1791      1848
## k__Bacteria; p__Actinobacteria      1494         8      1343         3
## k__Bacteria; p__Proteobacteria        43         6      1582       165
## k__Bacteria; p__Bacteroidetes          4      3400      2367      5901
## k__Bacteria; p__Tenericutes            0         4         0         0
## k__Bacteria; p__Fusobacteria           0         1       914         1
##                                700038978 700103517 700099408 700109551
## k__Bacteria; p__Firmicutes          2619      3657      3864      1910
## k__Bacteria; p__Actinobacteria        11       183       440      1286
## k__Bacteria; p__Proteobacteria       191      1171      2246       795
## k__Bacteria; p__Bacteroidetes       4995      2491       902      2321
## k__Bacteria; p__Tenericutes          121         9         1         0
## k__Bacteria; p__Fusobacteria           1       440       422      1685
##                                700101269 700097718 700114767 700106089
## k__Bacteria; p__Firmicutes          2160      3844      1305      7937
## k__Bacteria; p__Actinobacteria       788       265         7         3
## k__Bacteria; p__Proteobacteria      1176      1234       558        47
## k__Bacteria; p__Bacteroidetes       2465      2315      6130         9
## k__Bacteria; p__Tenericutes            0         0         0         0
## k__Bacteria; p__Fusobacteria        1391       331         0         0
##                                700106563 700024149 700097650 700023313
## k__Bacteria; p__Firmicutes          4997      1949      3012      5914
## k__Bacteria; p__Actinobacteria      2840      4315        27        29
## k__Bacteria; p__Proteobacteria       136      1616        15       161
## k__Bacteria; p__Bacteroidetes         26        97      3820      1827
## k__Bacteria; p__Tenericutes            0         0       903        24
## k__Bacteria; p__Fusobacteria           1        13         0         1
##                                700107864 700106637 700114872 700097589
## k__Bacteria; p__Firmicutes           433       496      1852      2073
## k__Bacteria; p__Actinobacteria      7555      7261      3018        12
## k__Bacteria; p__Proteobacteria         5       242        82       114
## k__Bacteria; p__Bacteroidetes          5         1      2848      5431
## k__Bacteria; p__Tenericutes            0         0       199         2
## k__Bacteria; p__Fusobacteria           2         0         0         1
##                                700103653 700114818 700097313 700014956
## k__Bacteria; p__Firmicutes          1644      1765      1953      2135
## k__Bacteria; p__Actinobacteria         3        16         2        27
## k__Bacteria; p__Proteobacteria      1368       878       372       299
## k__Bacteria; p__Bacteroidetes       4985      5341      5641      5537
## k__Bacteria; p__Tenericutes            0         0         0         0
## k__Bacteria; p__Fusobacteria           0         0         2         1
##                                700013596 700095235 700101366 700103710
## k__Bacteria; p__Firmicutes          2774      1403      1786      2144
## k__Bacteria; p__Actinobacteria        21         2        23        12
## k__Bacteria; p__Proteobacteria       133        20       551       471
## k__Bacteria; p__Bacteroidetes       4478      6381      5633      5371
## k__Bacteria; p__Tenericutes            4       188         0         0
## k__Bacteria; p__Fusobacteria           0         0         0         0
##                                700023267 700038950 700103303 700111775
## k__Bacteria; p__Firmicutes          2768       105      2849      2969
## k__Bacteria; p__Actinobacteria         1         2      2607       188
## k__Bacteria; p__Proteobacteria       202         9       344      1509
## k__Bacteria; p__Bacteroidetes       5016      7884      1342      2194
## k__Bacteria; p__Tenericutes            0         0       804         0
## k__Bacteria; p__Fusobacteria           0         0         0      1110
##                                700101335 700016405 700108266 700024097
## k__Bacteria; p__Firmicutes          1633      2177      5489      1492
## k__Bacteria; p__Actinobacteria      1700      4136       720      3730
## k__Bacteria; p__Proteobacteria       366       568       408      2768
## k__Bacteria; p__Bacteroidetes       2572       338      1191         4
## k__Bacteria; p__Tenericutes            0         0         0         0
## k__Bacteria; p__Fusobacteria        1671       769       184         3
##                                700095275
## k__Bacteria; p__Firmicutes          7771
## k__Bacteria; p__Actinobacteria        53
## k__Bacteria; p__Proteobacteria        94
## k__Bacteria; p__Bacteroidetes         76
## k__Bacteria; p__Tenericutes            6
## k__Bacteria; p__Fusobacteria           0

So, look at how the otu_table is set up. Use a function to get the dimensions

For finding the number of phyla, we probably want the number of unique phyla, so that is:

# function that finds the number of unique things ...
##  [1] "k__Bacteria; p__Firmicutes"      "k__Bacteria; p__Actinobacteria" 
##  [3] "k__Bacteria; p__Proteobacteria"  "k__Bacteria; p__Bacteroidetes"  
##  [5] "k__Bacteria; p__Tenericutes"     "k__Bacteria; p__Fusobacteria"   
##  [7] "k__Bacteria; p__Verrucomicrobia" "k__Bacteria; p__Cyanobacteria"  
##  [9] "k__Bacteria; p__Spirochaetes"    "k__Bacteria; p__TM7"            
## [11] "k__Bacteria; p__Lentisphaerae"   "k__Bacteria; p__[Thermi]"       
## [13] "k__Bacteria; p__Synergistetes"   "k__Bacteria; p__Acidobacteria"  
## [15] "k__Bacteria; p__SR1"             "k__Archaea; p__Euryarchaeota"   
## [17] "k__Bacteria; p__Fibrobacteres"   "k__Bacteria; p__Chloroflexi"    
## [19] "k__Bacteria; p__GN02"            "k__Bacteria; p__Planctomycetes"

3

filtering out taxa that occur in fewer than 5 inds … look to the earlier walkthrough on how to do this…. but then later it said at the genus level, so then you need to use otu_tax_lev() to get those with genus information

## [1] 158

Normalizing the data

# Define the first column of otu.m as the variable 'taxonomy'
taxonomy = otu.gen.filtered[,1]

# boxCoxNormalize() requires a matrix with only numeric data
otu.n <- as.matrix(otu.gen.filtered[,2:ncol(otu.gen.filtered)])

# apply boxCoxNormalize() using a for loop to the new matrix
for (i in 1:nrow(otu.n)){
  otu.n[i,] = boxCoxNormalize(otu.n[i,])
}
## Warning in estimateTransform.default(X, Y, weights, family, start,
## method, : Convergence failure: return code = 52
## Warning in estimateTransform.default(X, Y, weights, family, start,
## method, : Convergence failure: return code = 52
# add 'taxonomy' back as column #1 to otu.n. Why is the as.data.frame function important here?
otu.gen.normalized <-  cbind(taxonomy, as.data.frame(otu.n))

matching the metadata to the samples that we have:

# finding the samples in the metadata that match the samples we have 
metadata <- metadata[as.character(metadata$SampleID) %in%
                       colnames(otu.gen.normalized[,2:ncol(otu.gen.normalized)]),]
# finding the order that the rows match the colums
mm <- match(colnames(otu.gen.normalized[, -1]),
            as.character(metadata$SampleID))

# reordering the metadata based on matching rows
metadata <- metadata[mm,]

# checking to see if the names are the same 
as.character(metadata$SampleID) ==
  colnames(otu.gen.normalized[,2:ncol(otu.gen.normalized)])
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [15] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [29] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [43] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [57] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [71] TRUE TRUE TRUE TRUE TRUE TRUE

Defining Covariates

sex <- metadata$Sex
sex <- factor(sex,c("male","female"))
bs <- metadata$BodySite
bs <- factor(bs,c("Subgingival_plaque","Saliva","Left_Retroauricular_crease","Stool","Mid_vagina"))
ba <- metadata$Description
ba <- factor(ba,c("Urogenital_tract","Oral","Gastrointestinal_tract","Skin"))

Creating Empy vectors for p-values

pvals.sex <- vector()
pvals.bs <- vector()
pvals.ba <- vector()

Doing all this stuff with for loops

otu.m <- otu.gen.normalized

for(i in 1:nrow(otu.m)){
  y <- as.numeric(otu.m[i,2:ncol(otu.m)])
  aovOut <- aov(y ~ sex)
  pvals.sex[i] <- summary(aovOut)[[1]][1,5]
}

for(i in 1:nrow(otu.m)){
  y <- as.numeric(otu.m[i,2:ncol(otu.m)])
  aovOut <- aov(y ~ bs)
  pvals.bs[i] <- summary(aovOut)[[1]][1,5]
}

for(i in 1:nrow(otu.m)){
  y <- as.numeric(otu.m[i,2:ncol(otu.m)])
  aovOut <- aov(y ~ ba)
  pvals.ba[i] <- summary(aovOut)[[1]][1,5]
}

False Discovery rate correction

pvals.sex = p.adjust(pvals.sex, "fdr")
pvals.bs = p.adjust(pvals.bs, "fdr")
pvals.ba = p.adjust(pvals.ba, "fdr")

How many p-vals are significant

sum(pvals.sex < 0.05)
## [1] 3
sum(pvals.bs < 0.05)
## [1] 146
sum(pvals.ba < 0.05)
## [1] 145

Which OTUs are significant?

otu.m[which(pvals.sex < 0.05),1]
otu.m[which(pvals.bs < 0.05),1]
otu.m[which(pvals.ba < 0.05),1]

… this dumps a ton of output

Answering 3A

We can use this code to find out how many taxa differ significantly at the genus level:

sum(pvals.sex < 0.05)
sum(pvals.bs < 0.05)
sum(pvals.ba < 0.05)

So, for my dataset, there are 3 genera that differ based on sex, 146 genera that differ based on body site, and 145 genera that differ based on body area

The rest of the document

so, it’s pretty much follow the walkthrough and copy-paste commands – so go forth and have fun.